Most techniques for achieving syntactic category disambiguation are
statistical. They hinge on the frequency of a word occurring as a
particular part of speech (derived from corpus analysis), in
combination with the probability of the word being a particular part
of speech given the probability of the categories of words around
it. Such techniques can achieve remarkable accuracy (in the 96-97%
range). Recent experience with the CLAWS4 tagger designed for the
British National Corpus
(Leech 1994, Leech
1997) suggests, however,
that the accuracy can only be improved through the addition of
non-probabilistic rules which define highly specific contexts in
which a certain part of speech is more likely for a particular
(syntactically) ambiguous lexical form. These rules capture lexical
regularities, such as idioms, naming expressions, and noun compounds
(Leech 1997). Achieving a few extra percentage points of accuracy
requires investing a lot of effort into defining specific rules, but
clearly the lexicon can play an important role in achieving
near-perfect syntactic category tagging.
Systems in the first group utilise grammar rules, generally derived from a theory of linguistic structure, to guide the processing of an input string. For this approach the lexicon must contain information in addition to morphological form and syntactic category. Subcategorisation information is generally required to help identify groups of words which form a phrase. This information is also useful for circumscribing the complexity of the disambiguation task, in that a word might be associated with different subcategorisation frames on its different meanings, and so the syntactic context can constrain which lexical entry is appropriate.
The grammar rules will take on different forms depending on the
theoretical framework of the specific grammar. In unification-based
grammars, for example, grammar rules make extensive use of features
(and their values) represented in the lexical entries of words,
including such features as case, gender, and tense. The rules
themselves will dictate how the feature structures representing
different kinds of words can be combined into feature structures
representing phrases, and then how phrases can be combined. Parsing
strategies with a grammar can proceed either top-down, i.e. by
predicting the category of word to look for according to the grammar
rules, or bottom-up, i.e. allowing the lexical information
associated with the current word to drive the choice of rule to
attempt to instantiate. carroll:94 points out that
bottom-up parsing is more attractive for natural language grammars,
given that highly specific syntactic (and semantic) information can
be represented in the lexical entries for particular words, which
will severely restrict the number of potential rules applicable at
any point in the parsing. Some bottom-up parsers are also augmented
with top-down prediction to improve performance, resulting in a
hybrid strategy. That the richer the lexical information available
to the grammar, the fewer the ambiguous parses that it will generate
is, however, clear. The utility of increased lexical representation
is shown through the success of lexicalist grammar frameworks like
HPSG for computation.
This reduction in the number of ambiguous parses, however, comes
with the cost of an increase in complexity during parsing -- if
there are more lexical entries for a particular word, more
ambiguities will have to be entertained during parsing. The problem
of discriminating senses/lexical entries must therefore be balanced
with the reduction of structural ambiguity.
Statistical approaches come in a variety of forms, ranging from hybrid models in which grammars (such as those described above) are augmented with rule application probabilities derived from a corpus (e.g. Carroll and Briscoe 1992, Brew brew:95, Abney abney:96) to parsers which utilise no grammar rules whatsoever, but derive a probabilistic model of syntactic derivation on the basis of a corpus of labelled syntactic tree structures which is analysed for the occurrence frequencies of tree fragments (e.g. the data-oriented processing approach of Bod and Scha 1996). Both of these kinds of approaches are aimed at identifying the most probable parse of a sentence, given that a single sentence can often be associated with multiple syntactic structures. The former type require an amount of lexical information which is dictated by the specificity of the grammar rules (as above), while the latter type require only a finite set of terminal nodes for the syntactic trees, corresponding to word forms.
The questions for this latter type, however, are what guides the initial syntactic analysis of the corpus of syntactic trees from which the statistical model is derived, and how are the lexical categories for the word forms in the corpus determined (e.g. what is the bootstrapping mechanism for acquisition of such a model of language structure?). This approach will also have difficulties coping with unknown words in the text to be parsed, and the ambiguities derived from words which appear in multiple syntactic categories and words with senses which appear in distinct syntactic environments may result in the creation of inaccurate models. How these issues are to be addressed still needs to be investigated; it is likely that the addition of semantic annotations to the syntax trees in the corpus will aid the further disambiguation of the parsing process (Bod and Scha bod_scha:96).
Increased lexical information and constraints on semantic composition derived from a theoretical framework are likely to improve the accuracy of such statistical models, as they suffer from the problem of unseen (or zero) data. The issue of how to treat an unknown word arises from the fact that statistical training sets can never encompass every possible word. If the system assigns zero probability to unknown words, it will rule out words which might be licensed linguistically. If it instead assigns a small probability to unknown words, it will allow words which are linguistically impossible. The key to addressing this is to combine the statistical models with a linguistic framework which provides some predictive mechanisms for word formation and sense extensions, as well as constraints on these mechanisms (e.g. as proposed by Briscoe and Copestake briscoe_copestake:96).
Krovetz (1991) highlights lexical ambiguity as problematic for IR: many documents will be retrieved in response to a query which are not relevant, because the sense of a word intended in the query differs from the sense of the same word in the document context. He proposes to combat this problem by making much more elaborate use of the lexicon, and indexing documents according to word senses rather than words and to group synonym lists (which are necessary to identify relevant documents which may not contain the exact words in the query) according to word senses. He notes, however, that improvements can be gained without identifying the single correct sense of a word, but instead by ruling out as many incorrect senses as possible. This means that even coarse sense disambiguation (at the level of homonymy) can be useful for IR; we will see below (Section 6.4) that this can be achieved through statistical measures of semantic relatedness of the context to a word sense.
Unfortunately, subsequent investigation into the performance benefits from adding word sense disambiguation to IR systems reveals varied judgements on its utility. kilgarriff:97b summarises reports of improved performance resulting from word sense disambiguation as ranging from 2% to 14% in IR systems. Kilgarriff quotes Karen Sparck Jones as saying:
The fact is, that in relation to IR research as a whole, there's rather little work on WSD just precisely because, as you remark, mechanisms that are independently desirable for improving query quality, namely (in general) adding more query terms to increase the number of conjoint matches, also incidentally achieve disambiguation.In other words, the ambiguity of word meaning is only an issue for very short queries, which do not provide any disambiguating terms, but not for longer queries in which the combination of multiple terms specifies the desired document context more precisely. Given these facts, the lexicon may not need to be complexly structured for the task of IR. Yet it may turn out that more lexical information could ultimately improve performance; after all, the texts which are to be retrieved are written in natural language and a more sophisticated interpretation would very likely lead to more relevant retrieval. The potential impact on the IR performance will need to be balanced with the complexity of increasing linguistic analysis of the texts.
Furthermore, identification of the context is critical for accurate
translation of words in the source language which correspond to
several different possible words in the target language,
and of words
ambiguous in the source language for which each sense corresponds to
a different word in the target language. However, translation
equivalents which are associated with the same set of senses in both
source and target language, i.e. which are ambiguous in the same
ways in both languages, do not need to be disambiguated for the
purposes of translation, since the system does not face a
correspondence problem in these cases. So the disambiguation
problem for Machine Translation applications is not as serious as
for general natural language understanding, but word senses must be
taken into consideration in the design of the multilingual lexicon.
These systems vary in the specificity of the domains to which they are applied, but it seems clear that the more general the domain, the more important is the capacity of the system to construct an interpretation of the text, relevant to user needs and knowledge, from which inferences can be drawn. This follows from the observation that language use in specific domains is more restricted in terms of both vocabulary and syntactic structure than in general language. Applications working within specific domains can therefore employ heuristics geared towards language use specific to the domain. More general understanding, on the other hand, cannot be done without making use of general linguistic principles and accurate identification of intended word meaning. Lexical structure needs in particular to reflect semantic relations such as hyperonymy/hyponymy (i.e. the hierarchical relations between words) and synonymy/antonymy in order to capture generalisations about the ways in which similar words can be used.
kilgarriff:97b claims that word sense ambiguity is not much of a problem for natural language understanding systems, given that such systems are domain specific and generally have domain models. He says, ``It is generally necessary to have a detailed knowledge of the word senses that are in the domain, so the knowledge to disambiguate will often be available in the domain model even where it has not explicitly been added for disambiguation purposes.'' I find his conclusion unfounded, however, since regardless of where the information needed to disambiguate comes from, disambiguation must happen and can prove to involve arbitrarily complex reasoning. The more lexical structure that exists, the less reasoning is necessary in the domain model or the pragmatic component of the system. A balance must be found between the load which disambiguation imposes on pragmatics and the complexity of the structure of the lexicon. Furthermore, in dialogue systems, the system has virtually no control over what the user says and a domain-restricted lexicon will almost certainly not be able to properly interpet the full range of user input.
Those working on NLU applications may find that a hybrid approach
combining statistical techniques with linguistic analyses is the
most effective and efficient way of addressing the demands of
general language understanding. This is in fact what I will argue
for in my discussion of lexical acquisition
(Section 6.6). In such an approach the lexicon is likely
to play a key role in processing, as the repository of word- and
phrase-specific morphological, syntactic, and semantic information.
Interpretation of a sentence cannot proceed without both an accurate
representation of its syntactic relationships, derived through
parsing,
and mechanisms for combining the
meaning contribution of the words and phrases in the sentence. The
semantic contribution of words needs to be identified from the
lexicon; to a certain degree the combination of word meanings into a
coherent structure can be governed by the specifications in the
lexicon as well. The problem of word sense disambiguation must be
addressed from a task-specific perspective: how precise must the
determination of the intended word sense be in order to generate an
appropriate response? The solution must also take into
consideration the distribution of load on various modules of the NLU
system: how much disambiguation should occur at the level of
selecting an appropriate lexical entry, and how much should it rely
on discourse, world knowledge, or domain-level processing to
refine the sense associated with an (underspecified) lexical entry?
The answer to this question will stem from the needs of the task and
the framework which drives the linguistic processing.
The solution for linguistic realisation will need to rely heavily on a rich lexicon of syntactic and semantic information, governed by constraints on the combination of words into sentences and discourse coherence factors. Statistical techniques cannot suffice for generating sentences which are not only plausible syntactically, but contextually relevant, because they capture relationships between words and phrases without having a notion of the underlying structure of those relationships, or the motivations for highlighting such a relationship. Only a solid linguistic framework can guide such processing adequately.
The problem of polysemy must also be carefully addressed within NLG systems, since the range of meanings a word can take on in specific syntactic environments is directly relevant to the problem of forming a grammatically correct and easily understandable sentence using that word. If an NLG system attempts to circumscribe the problem of lexical choice by having a sparse, domain-specific lexicon, it may inadvertently produce something which a user might interpret in a different way or as ambigous. The particular meaning actually conveyed by a word in context must be no more and no less informative than the sense intended by the NLG system in order to avoid discrepancies between what the user understands and what the system ``believes'' it has told the user. To tackle these issues, the system must incorporate mechanisms for sense disambiguation which model that of the users, and must therefore look to linguistic theory for insights into the most effective lexical structure and lexical reasoning techniques governing word sense disambiguation.
Statistical information may prove useful in NLG to a certain extent, as an addition to linguistically derived mechanisms. One can imagine a lexicon augmented with probabilities indicating the frequency with which a word appears with a particular meaning; extremely low probability word senses should then generally be dispreferred in generation. This information could thus influence the way the lexicon is used in addressing the polysemy problem in NLG tasks.
The previous discussion has highlighted the differing needs for lexical representation among various NLP systems. This representation can range from a very shallow list of morphological forms to a highly structured and fine-grained lexicon which derives from linguistic theory. I emphasised the task-dependency of the choice of lexical representation and structure -- the increased level of understanding which is necessary in an NLP system, the more important issues stemming from polysemy become and the more attention must be paid to the lexicon. I will review the problems underlying the represention of polysemy in Section 6.3, and will consider existing approaches to word sense disambiguation in Section 6.4.
How the lexicon can be built up and how it can be modified and updated when necessary, however, must be considered in the design of any computational lexicon. I will therefore discuss the problem of lexical acquisition in Section 6.5.